Main Questions we want to answers to:

  1. Where are the most expensive neighbourhoods for listings? What are there prices like?
  2. What kind of properties are these listings (type, bedroom count, bathroom count, etc.)
  3. What is the best neighbourhood for a certain budget with other preferences?
  4. Even though reveiws are subjective, where/who should I stay with to increase my own probability of having a good experience? In a cheap or expensive neighbourhood
  5. Does host play a factor in my experience?
In [309]:
# imports 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import plotly.express as px
import seaborn as sns
import json
import ipywidgets as widgets
%matplotlib inline

# my favorite
plt.style.use("fivethirtyeight")

# show full columns
pd.set_option('display.max_columns', None)

# cell width 
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

Datasets available to us: listings, reviews, geographical information

In [422]:
# listings data
ls = pd.read_csv("../data/listings.csv")
ls_d = pd.read_csv("../data/listings 2.csv")


# reviews data
rs = pd.read_csv("../data/reviews.csv")
rs_d = pd.read_csv("../data/reviews 2.csv")


# geography data
geo = pd.read_csv("../data/neighbourhoods.csv")

with open("../data/neighbourhoods.geojson") as jsonfile:
    geojson = json.load(jsonfile)
/Users/ldugom/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3058: DtypeWarning:

Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False.

In [315]:
geo.head()
Out[315]:
neighbourhood_group neighbourhood
0 NaN Brightwood Park, Crestwood, Petworth
1 NaN Brookland, Brentwood, Langdon
2 NaN Capitol Hill, Lincoln Park
3 NaN Capitol View, Marshall Heights, Benning Heights
4 NaN Cathedral Heights, McLean Gardens, Glover Park

Preliminary: What the neighbourhoods in DC? How many listings are in each?

In [360]:
import plotly.express as px
# set token
px.set_mapbox_access_token("pk.eyJ1IjoibGF3cmVuY2VkIiwiYSI6ImNrODFzZnFnNzA0YmczZW9nNWN4aTFvdngifQ.VlB5-L7owXKEXo8JEePk7w")
fig = px.choropleth_mapbox(ls, geojson=geojson, color="neighbourhood", title="Washington D.C. Neighbourhood Map",
                           locations="neighbourhood", featureidkey="properties.neighbourhood",opacity=0.5, color_discrete_sequence=px.colors.qualitative.Light24,
                           center={"lat": 38.9072, "lon": -77.0369},
                           mapbox_style="light", zoom=11)

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
In [423]:
# melt listings data for graphing purposes 
melted_ls = pd.melt(ls,id_vars=['id', 'name', 'host_id', 'host_name', 'neighbourhood_group','neighbourhood', 'latitude', 'longitude'])
In [424]:
# create dropdown for attributes in melted dataframe
dropdown_attribute = widgets.Dropdown(options = sorted(melted_ls.variable.unique()))

# output
output = widgets.Output()


def view_attribute(attribute):
    
    # clear output for new attribute to be plotted 
    output.clear_output()
    
    # filter df to selected attribute
    filtered = melted_ls[melted_ls.variable == attribute]
    
    with output:
        # set token
        px.set_mapbox_access_token("pk.eyJ1IjoibGF3cmVuY2VkIiwiYSI6ImNrODFzZnFnNzA0YmczZW9nNWN4aTFvdngifQ.VlB5-L7owXKEXo8JEePk7w")
        fig = px.scatter_mapbox(melted_ls.sample(frac=0.1), lat="latitude", lon="longitude", color="value", template="seaborn",
                                   opacity=0.5, color_continuous_scale="Viridis", 
                                   center={"lat": 38.9072, "lon": -77.0369},
                                   mapbox_style="streets", zoom=11)

        fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
        fig.show()

# change attribute
def dropdown_attribute_handler(change):
    view_attribute(change.new)
    
dropdown_attribute.observe(dropdown_attribute_handler)
display(dropdown_attribute)
display(output)

Quesiton 1: Where are the most expensive neighbourhoods for listings? What are there prices like?

Here we can see that the most expensive neighbourhoods for listings form a crescent around D.C., starting from Georgetown going all the way south to Capitol Hill/Southwest Waterfront.

In [425]:
# create bins to eliminate outliers from dominating continuous scale
ls['Binned Price'] = pd.qcut(ls['price'], 10)

# set token
px.set_mapbox_access_token("pk.eyJ1IjoibGF3cmVuY2VkIiwiYSI6ImNrODFzZnFnNzA0YmczZW9nNWN4aTFvdngifQ.VlB5-L7owXKEXo8JEePk7w")
fig = px.scatter_mapbox(ls, lat="latitude", lon="longitude", color="Binned Price", template="simple_white", opacity=0.5, 
                         color_discrete_sequence=px.colors.diverging.RdYlGn, category_orders={"Binned Price":sorted(ls["Binned Price"].unique())}, hover_data=["neighbourhood", "price"],
                           center={"lat": 38.895, "lon": -77.024},
                           mapbox_style="basic", zoom=12)

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})

If we look at the mean and median listing price/night across the neighbourhoods, we can see that outliers also are a a factor.

In [434]:
# is outlier column based on zscore of 3
ls['is_outlier'] = ls["price"].apply(lambda x: 1 if (np.abs(x-ls.price.mean()) >= (3*ls.price.std())) else 0)
ls['price non-outlier'] = ls.apply(lambda x: x["price"] if (x['is_outlier'] != 1) else 0, axis=1)

#summary
summary = ls.groupby("neighbourhood").agg({'price':['mean', 'median'] ,'price non-outlier':'mean', 'is_outlier':['sum', 'count']}).reset_index()
summary.columns = ['Neighbourhood', 'Mean Price', 'Median Price', 'Mean Price of non-outliers','# of outliers', '# of listings']

# mean price by nieghborhood 
(
    summary
    .sort_values(["Mean Price of non-outliers","# of listings"], ascending=False)
    .style.background_gradient(cmap='RdYlGn', subset=['Mean Price', 'Median Price', 'Mean Price of non-outlier']) 
    
          
)
/Users/ldugom/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1418: FutureWarning:


Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike

/Users/ldugom/anaconda3/lib/python3.7/site-packages/matplotlib/colors.py:527: RuntimeWarning:

invalid value encountered in less

Out[434]:
Neighbourhood Mean Price Median Price Mean Price of non-outliers # of outliers # of listings
36 West End, Foggy Bottom, GWU 304.281 150 297.603 1 292
11 Downtown, Chinatown, Penn Quarters, Mount Vernon Square, North Capitol Street 274.611 173 252.505 2 380
31 Southwest Employment Area, Southwest/Waterfront, Fort McNair, Buzzard Point 270.191 189.5 246.261 2 188
17 Georgetown, Burleith/Hillandale 296.938 162 211.031 6 259
4 Cathedral Heights, McLean Gardens, Glover Park 204.809 99.5 204.809 0 136
29 Shaw, Logan Circle 290.918 144.5 203.604 10 670
26 North Cleveland Park, Forest Hills, Van Ness 198.028 110 198.028 0 71
20 Howard University, Le Droit Park, Cardozo/Shaw 333.939 130 197.448 7 359
2 Capitol Hill, Lincoln Park 260.604 135 189.688 16 834
35 Union Station, Stanton Park, Kingman Park 220.582 125 187.69 12 874
32 Spring Valley, Palisades, Wesley Heights, Foxhall Crescent, Foxhall Village, Georgetown Reservoir 288.105 122.5 181.128 3 86
5 Cleveland Park, Woodley Park, Massachusetts Avenue Heights, Woodland-Normanstone Terrace 199.611 115 178.372 1 113
25 Near Southeast, Navy Yard 177.556 131 177.556 0 81
12 Dupont Circle, Connecticut Avenue/K Street 250.692 142.5 175.382 11 712
14 Edgewood, Bloomingdale, Truxton Circle, Eckington 168.55 100 159.284 3 723
18 Hawthorne, Barnaby Woods, Chevy Chase 235.377 100 159 2 61
22 Kalorama Heights, Adams Morgan, Lanier Heights 174.108 119 147.339 5 407
7 Columbia Heights, Mt. Pleasant, Pleasant Plains, Park View 148.453 99 143.97 2 892
10 Douglas, Shipley Terrace 143.71 65 143.71 0 31
19 Historic Anacostia 143.361 75 143.361 0 61
34 Twining, Fairlawn, Randle Highlands, Penn Branch, Fort Davis Park, Fort Dupont 140.739 80 140.739 0 119
16 Friendship Heights, American University Park, Tenleytown 136.488 99 136.488 0 86
0 Brightwood Park, Crestwood, Petworth 144.828 88 132.824 3 529
15 Fairfax Village, Naylor Gardens, Hillcrest, Summit Park 128.636 99 128.636 0 33
6 Colonial Village, Shepherd Park, North Portal Estates 127.975 86 127.975 0 40
8 Congress Heights, Bellevue, Washington Highlands 166.463 92 123.171 2 82
37 Woodland/Fort Stanton, Garfield Heights, Knox Hill 120.556 69 120.556 0 9
21 Ivy City, Arboretum, Trinidad, Carver Langston 127.267 95 119.752 1 266
3 Capitol View, Marshall Heights, Benning Heights 118.04 70 118.04 0 75
28 River Terrace, Benning, Greenway, Dupont Park 179.056 70 117.944 1 54
38 Woodridge, Fort Lincoln, Gateway 111.846 80 111.846 0 65
27 North Michigan Park, Michigan Park, University Heights 110.24 75 110.24 0 96
13 Eastland Gardens, Kenilworth 108.429 70 108.429 0 14
1 Brookland, Brentwood, Langdon 118.193 79.5 108.325 1 166
23 Lamont Riggs, Queens Chapel, Fort Totten, Pleasant Hill 131.128 75 101.214 1 117
33 Takoma, Brightwood, Manor Park 96.852 68 96.852 0 196
24 Mayfair, Hillbrook, Mahaning Heights 94.4915 65 94.4915 0 59
30 Sheridan, Barry Farm, Buena Vista 136.478 72 93 1 46
9 Deanwood, Burrville, Grant Park, Lincoln Heights, Fairmont Heights 92.766 75 92.766 0 47
In [454]:
hist_data.head()
Out[454]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 Binned Price is_outlier price non-outlier
0 3362 Convention Center Rowhouse & In Law: 2 Units, 4BR 2798 Ayeh NaN Shaw, Logan Circle 38.90982 -77.02016 Entire home/apt 433 2 177 2020-02-02 1.31 5 138 (350.0, 10000.0] 0 433
2 3670 Beautiful Sun-Lit U Street 1BR/1BA 4630 Sheila NaN Howard University, Le Droit Park, Cardozo/Shaw 38.91842 -77.02750 Private room 75 2 79 2018-07-25 1.31 1 0 (70.0, 85.0] 0 75
6 4197 Bedroom in DC 2 blocks to Metro 5061 Sandra NaN Capitol Hill, Lincoln Park 38.88791 -76.99668 Private room 83 7 44 2020-01-25 0.34 2 225 (70.0, 85.0] 0 83
7 4501 DC Rowhouse 1585 Kip NaN Shaw, Logan Circle 38.91331 -77.02436 Private room 475 2 120 2015-07-16 0.89 1 0 (350.0, 10000.0] 0 475
14 11785 Sanctuary near Cathedral 32015 Teresa NaN Cathedral Heights, McLean Gardens, Glover Park 38.92828 -77.07638 Entire home/apt 125 1 386 2020-02-16 3.13 4 348 (115.0, 135.0] 0 125
In [456]:
most_expensive_neighbourhoods 
Out[456]:
['West End, Foggy Bottom, GWU',
 'Downtown, Chinatown, Penn Quarters, Mount Vernon Square, North Capitol Street',
 'Southwest Employment Area, Southwest/Waterfront, Fort McNair, Buzzard Point',
 'Georgetown, Burleith/Hillandale',
 'Cathedral Heights, McLean Gardens, Glover Park',
 'Shaw, Logan Circle',
 'North Cleveland Park, Forest Hills, Van Ness',
 'Howard University, Le Droit Park, Cardozo/Shaw',
 'Capitol Hill, Lincoln Park',
 'Union Station, Stanton Park, Kingman Park']
In [484]:
dfs = []
for x in most_expensive_neighbourhoods:
    dfs.append(hist_data[hist_data.neighbourhood == x]['price'])

General Summary for most expensive neighborhoods:

In [489]:
import plotly.figure_factory as ff

# top ten nieghborhoods
most_expensive_neighbourhoods = summary.sort_values(["Mean Price of non-outliers"], ascending=False)[:10].Neighbourhood.values.tolist()

# filter to no outliers 
#hist_data = ls[ls.neighbourhood.isin(most_expensive_neighbourhoods) & (np.abs(ls.price-ls.price.mean()) <= (3*ls.price.std()))]

colors = ['#755dd9','#ccccff' ,'#aed4ff' ,'#00fa9a' ,'#ffefd5' ,'#ccccff' ,'#ffdeaa' ,'#d3e7b1' ,'#f26964','#f4a889']

# Create distplot with curve_type set to 'normal'
fig = ff.create_distplot(dfs, most_expensive_neighbourhoods, colors=colors,show_rug=False)

fig.update_layout(title_text='Hist and Curve Plot')
fig.show()
In [437]:
summary.sort_values(["Mean Price of non-outliers"], ascending=False)[:10].Neighbourhood
Out[437]:
array(['West End, Foggy Bottom, GWU',
       'Downtown, Chinatown, Penn Quarters, Mount Vernon Square, North Capitol Street',
       'Southwest Employment Area, Southwest/Waterfront, Fort McNair, Buzzard Point',
       'Georgetown, Burleith/Hillandale',
       'Cathedral Heights, McLean Gardens, Glover Park',
       'Shaw, Logan Circle',
       'North Cleveland Park, Forest Hills, Van Ness',
       'Howard University, Le Droit Park, Cardozo/Shaw',
       'Capitol Hill, Lincoln Park',
       'Union Station, Stanton Park, Kingman Park'], dtype=object)
In [ ]:
 

How expensive are these outliers? We can see that the upper percentiles start at ~ \$2000, with a few going all the way to $10,000 !

In [404]:
# clean price 
ls_d["price"] = ls_d["price"].str[1:].str.replace(",","").astype(float)

# detect outliers (93 total)
outliers = ls_d[~(np.abs(ls.price-ls.price.mean()) <= (3*ls.price.std()))]

# plot
plt.figure(figsize=(15, 10))
ax = sns.swarmplot(y="neighbourhood", x="price", data=outliers)
In [ ]:
 

Question 2: What kind of properties are these listings (type, bedroom count, bathroom count, etc.)

We can see that the majority of listings are homes nad apartments, then private rooms, followed by shared rooms and hotel rooms.

In [376]:
# set token
px.set_mapbox_access_token("pk.eyJ1IjoibGF3cmVuY2VkIiwiYSI6ImNrODFzZnFnNzA0YmczZW9nNWN4aTFvdngifQ.VlB5-L7owXKEXo8JEePk7w")
fig = px.scatter_mapbox(ls_d.dropna(subset=['beds']), lat="latitude", lon="longitude", color="room_type", template="simple_white", size="beds",
                          hover_data=["neighbourhood", "price"],
                           center={"lat": 38.895, "lon": -77.024},
                           mapbox_style="basic", zoom=12)

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
In [393]:
# set token
px.set_mapbox_access_token("pk.eyJ1IjoibGF3cmVuY2VkIiwiYSI6ImNrODFzZnFnNzA0YmczZW9nNWN4aTFvdngifQ.VlB5-L7owXKEXo8JEePk7w")
fig = px.scatter_mapbox(ls_d.dropna(subset=['beds']), lat="latitude", lon="longitude", color="property_type", template="simple_white", size="beds",
                          hover_data=["neighbourhood", "price"], color_discrete_sequence=px.colors.qualitative.Alphabet,
                           center={"lat": 38.895, "lon": -77.024},
                           mapbox_style="basic", zoom=12)

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})

Summary statistics for type of places

In [385]:
room_df = ls_d[['neighbourhood', 'price', 'bathrooms','bedrooms','beds', 'room_type']].copy()

room_df.groupby(["neighbourhood", "room_type"]).mean().reset_index().sort_values("price",ascending=False)
Out[385]:
neighbourhood room_type price bathrooms bedrooms beds
77 Downtown/Penn Quarter Hotel room 5089.500000 1.750 1.500000 1.500000
162 Logan Circle Shared room 3810.153846 1.000 1.000000 3.846154
245 U Street Corridor Shared room 2946.000000 3.875 1.000000 1.750000
56 Chevy Chase, MD Entire home/apt 1095.000000 4.500 5.000000 6.000000
25 Berkley Entire home/apt 1051.500000 3.000 2.833333 4.166667
... ... ... ... ... ... ...
203 Pleasant Plains Shared room 30.000000 1.000 1.000000 1.000000
29 Bloomingdale Shared room 26.000000 3.000 1.000000 1.000000
2 16th Street Heights Shared room 25.000000 1.000 1.000000 1.000000
174 Mount Pleasant Shared room 25.000000 1.000 1.000000 1.000000
148 Kingman Park Shared room 24.500000 1.250 1.000000 8.000000

257 rows × 6 columns

In [391]:
len(ls_d.property_type.unique())
Out[391]:
23
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: